Breaking down EDA for a set of 4 datasets¶

In [1]:
import numpy as np
import pandas as pd

from IPython.display import display, display_html , HTML

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.model_selection import learning_curve, cross_val_score, GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler,StandardScaler,MinMaxScaler

import warnings
warnings.filterwarnings('ignore')
In [2]:
characteristics = pd.read_csv('../input/2019-database-of-road-traffic-injuries/caracteristiques-2019.csv')
characteristics.name = 'characteristics'
places = pd.read_csv('../input/2019-database-of-road-traffic-injuries/lieux-2019.csv')
places.name = 'places'
drivers = pd.read_csv('../input/2019-database-of-road-traffic-injuries/usagers-2019.csv')
drivers.name = 'drivers'
vehicles = pd.read_csv('../input/2019-database-of-road-traffic-injuries/vehicules-2019.csv')
vehicles.name = 'vehicles'

datasets = [characteristics,places,vehicles,drivers]

characteristics = characteristics.set_index('Num_Acc')
places = places.set_index('Num_Acc')
vehicles = vehicles.set_index('id_vehicule')
drivers = drivers.set_index('id_vehicule')

pd.set_option('display.max_row',max(characteristics.shape[0],places.shape[0],drivers.shape[0],vehicles.shape[0]))
pd.set_option('display.max_column',max(characteristics.shape[1],places.shape[1],drivers.shape[1],vehicles.shape[1]))

for df in datasets:
    print ("The dataset",df.name,"has",df.shape[0],"rows and",df.shape[1],"columns")
The dataset characteristics has 58840 rows and 15 columns
The dataset places has 58840 rows and 18 columns
The dataset vehicles has 100710 rows and 11 columns
The dataset drivers has 132977 rows and 15 columns
In [3]:
display(HTML('<h1>characteristics</h1>'))
display(characteristics.head())
display(HTML('<h1>vehicles</h1>'))
display(vehicles.head())
display(HTML('<h1>drivers</h1>'))
display(drivers.head())
display(HTML('<h1>places</h1>'))
display(places.head())

characteristics

jour mois an hrmn lum dep com agg int atm col adr lat long
Num_Acc
201900000001 30.0 11.0 2019.0 0.062500 4.0 93 93053 1.0 1.0 1.0 2.0 AUTOROUTE A3 488962100.0 24701200.0
201900000002 30.0 11.0 2019.0 0.118056 3.0 93 93066 1.0 1.0 1.0 6.0 AUTOROUTE A1 489307000.0 23688000.0
201900000003 28.0 11.0 2019.0 0.635417 1.0 92 92036 1.0 1.0 1.0 4.0 AUTOROUTE A86 489358718.0 23191744.0
201900000004 30.0 11.0 2019.0 0.847222 5.0 94 94069 1.0 1.0 1.0 4.0 A4 488173295.0 24281502.0
201900000005 30.0 11.0 2019.0 0.166667 3.0 94 94028 1.0 1.0 1.0 2.0 A86 INT 487763620.0 24332540.0

vehicles

Num_Acc num_veh senc catv obs obsm choc manv motor occutc
id_vehicule
138 306 524 201900000001 B01 2 7 0 2 5 23 1 NaN
138 306 525 201900000001 A01 2 17 1 0 3 11 1 NaN
138 306 523 201900000002 A01 1 7 4 0 1 0 1 NaN
138 306 520 201900000003 A01 1 7 0 2 1 2 1 NaN
138 306 521 201900000003 B01 1 7 1 0 4 2 1 NaN

drivers

Num_Acc num_veh place catu grav sexe an_nais trajet secu1 secu2 secu3 locp actp etatp
id_vehicule
138 306 524 201900000001 B01 2 2 4 2 2002 0 1 0 -1 -1 -1 -1
138 306 524 201900000001 B01 1 1 4 2 1993 5 1 0 -1 -1 -1 -1
138 306 525 201900000001 A01 1 1 1 1 1959 0 1 0 -1 -1 -1 -1
138 306 523 201900000002 A01 1 1 4 2 1994 0 1 0 -1 -1 -1 -1
138 306 520 201900000003 A01 1 1 1 1 1996 0 1 0 -1 -1 0 -1

places

catr voie v1 v2 circ nbv vosp prof pr pr1 plan lartpc larrout surf infra situ vma
Num_Acc
201900000001 1.0 3 0.0 NaN 3.0 10.0 0.0 1.0 6.0 900.0 2.0 NaN NaN 1.0 2.0 1.0 70.0
201900000002 1.0 1 0.0 NaN 1.0 2.0 0.0 4.0 3.0 845.0 2.0 NaN NaN 1.0 0.0 1.0 70.0
201900000003 1.0 86 0.0 NaN 3.0 8.0 0.0 1.0 10.0 500.0 3.0 NaN NaN 1.0 0.0 1.0 90.0
201900000004 1.0 4 0.0 NaN 3.0 5.0 0.0 1.0 2.0 299.0 1.0 NaN NaN 1.0 0.0 1.0 90.0
201900000005 1.0 86 0.0 INT 1.0 3.0 0.0 1.0 41.0 0.0 3.0 NaN NaN 1.0 2.0 1.0 90.0

Expand this to read the features description

¶

Num_Acc¶

Identification number of the accident.

jour¶

Day of the accident.

mois¶

Month of the accident.

an¶

Year of accident.

hrmn¶

Hour and minutes of the accident. This one is tricky, it correspond to a percentage of 24h (your turn to convert it into day time)

lum¶

Light: lighting conditions in which the accident occurred:
  1. Full day
  2. Twilight or dawn
  3. Night without public lighting
  4. Night with public lighting not on
  5. Night with public lighting on

dep¶

Department: Code INSEE (National Institute of Statistics and Economic Studies) of department (2A Corse-du-Sud. 2B Haute-Corse).

com¶

Municipality: The municipality number is a code given by INSEE. The code is made up of the code INSEE of the department followed by 3 digits.

agg¶

Location :
  1. Outside agglomeration
  2. In built-up areas

int¶

Intersection:
  1. Excluding intersection
  2. Intersection in X
  3. T-intersection
  4. Y intersection
  5. Intersection with more than 4 branches
  6. Roundabout
  7. Place
  8. Level crossing
  9. Other intersection

atm¶

Atmospheric conditions:

-1. Not specified

  1. Normal
  2. Light rain
  3. Heavy rain
  4. Snow. hail
  5. Fog. smoke
  6. Strong wind. storm
  7. Dazzling weather
  8. Cloudy weather
  9. Other

col¶

Collision type:

-1. Not specified

  1. Two vehicles. frontal
  2. Two vehicles. from the rear
  3. Two vehicles. from the side
  4. Three vehicles and more. in a chain
  5. Three or more vehicles. multiple collisions
  6. Other collision 7. No collision

adr¶

Postal address: variable entered for accidents occurring in built-up areas.

lat¶

Latitude

Long¶

Longitude

catr¶

Road category:
  1. Highway
  2. National road
  3. Departmental road
  4. Communal roads
  5. Outside the public network
  6. Parking lot open to public traffic
  7. Urban metropolis roads
  8. other

voie¶

Route number.

V1¶

Numerical index of the road number (example: 2 bis, 3 ter etc.).

V2¶

Alphanumeric road index letter.

circ¶

Traffic regime:

-1. Not specified

  1. One way
  2. Bidirectional
  3. A separate carriageway
  4. With variable assignment channels

nbv¶

Total number of traffic lanes.

vosp¶

Indicates the existence of a reserved lane, regardless of whether or 
not the accident took place on this way.

-1. Not specified 0. Not applicable

  1. Cycle path
  2. Cycle lane
  3. Reserved lane

prof¶

Longitudinal profile describes the gradient of the road at the location of the accident:

-1. Not specified

  1. Flat
  2. Slope
  3. hilltop
  4. Bottom of coast

pr¶

Number of the associated PR (number of the upstream terminal). The value -1 means that the PR is not informed.

pr1¶

Distance in meters from the PR (in relation to the upstream terminal). The value -1 means that the PR is not informed.

plan¶

Plan layout:

-1. Not specified

  1. rectilinear part
  2. In a curve to the left
  3. In a curve to the right 4. In "S"

lartpc¶

Width of the central reservation (TPC) if it exists (in m).

larrout¶

Width of the roadway used for vehicular traffic is not included in the stopping strips emergency, TPC and parking spaces (in m).

surf¶

Surface condition: 

-1. Not specified

  1. Normal
  2. Wet
  3. Puddles
  4. Flooded
  5. Snowy
  6. Mud
  7. Icy
  8. Fat. oil
  9. Other

infra¶

Development. Infrastructure:

-1. Not specified 0. None

  1. Underground. tunnel
  2. Bridge. flyover
  3. Exchanger or connection sling
  4. Railroad
  5. Crossroads
  6. Pedestrian zone
  7. Toll zone
  8. Site
  9. Others

situ¶

Situation of the accident:

-1. Not specified 0. None

  1. On the road
  2. On emergency lane
  3. On the shoulder
  4. On the sidewalk
  5. On a cycle path
  6. On other special track
  7. Others

vma¶

Maximum authorized speed at the scene and at the time of the accident.

vehicle_id¶

Unique identifier of the vehicle used for each user occupying this vehicle (including pedestrians who are attached to the vehicles which collided with them). Numerical code.

Num_Veh¶

Identifier of the vehicle taken back for each of the users occupying this vehicle (including pedestrians who are attached to the vehicles which collided with them). Alphanumeric code.

senc¶

Flow direction :

-1. Not specified 0. Unknown

  1. PK or PR or increasing postal address number
  2. PK or PR or decreasing postal address number
  3. Lack of reference

catv¶

Vehicle category:
  1. Not determinable
  2. Bicycle
  3. Moped <50cm3
  4. Cart (Quadricycle with bodywork motor) (formerly "cart or motor tricycle")
  5. Reference not used since 2006 (registered scooter)
  6. Reference unused since 2006 (motorcycle)
  7. Reference unused since 2006 (sidecar)
  8. VL only
  9. Reference unused since 2006 (VL + caravan)
  10. Reference not used since 2006 (light vehicles + trailer)
  11. VU only 1.5T <= PTAC <= 3.5T with or without trailer (formerly VU only 1.5T <= PTAC <= 3.5T)
  12. Reference not used since 2006 (VU (10) + caravan)
  13. Reference not used since 2006 (VU (10) + trailer)
  14. PL only 3.5T <PTCA <= 7,5T
  15. PL only > 7.5T
  16. PL> 3,5T + trailer
  17. Road tractor only
  18. Road tractor + semi-trailer
  19. Reference not used since 2006 (public transport)
  20. Reference not used since 2006 (tram)
  21. Special gear
  22. Farm tractor
  23. Scooter <50 cm3
  24. Motorcycle> 50 cm3 and <= 125 cm3
  25. Scooter> 50 cm3 and <= 125 cm3
  26. Motorcycle> 125 cm3
  27. Scooter> 125 cm3
  28. Light quad <= 50 cm3 (Quadricycle without bodywork engine)
  29. Heavy quad> 50 cm3 (Quadricycle without bodywork engine)
  30. Bus
  31. Coach
  32. Train
  33. Tram
  34. 3WD <= 50 cm3
  35. 3WD> 50 cm3 <= 125 cm3
  36. 3WD> 125 cm3
  37. EDP with motor
  38. EDP without motor
  39. VAE
  40. Other vehicle

obs¶

Fixed obstacle struck:

-1. Not specified 0. Not applicable

  1. Parked vehicle
  2. Tree
  3. Metal slide
  4. Concrete slide
  5. Other slide
  6. Building, wall, bridge pier
  7. Vertical signage support or emergency call station
  8. Post
  9. Street furniture
  10. Parapet
  11. Island, refuge, upper terminal
  12. Sidewalk edge
  13. Ditch, embankment, rock face
  14. Other fixed obstacle on the road
  15. Other fixed obstacle on sidewalk or shoulder
  16. Clearance of the roadway without obstacle
  17. Nozzle. aqueduct head

obsm¶

Movable obstacle struck:

-1. Not specified 0. None

  1. Pedestrian
  2. Vehicle
  3. Rail vehicle
  4. Domestic animal
  5. Wild animal
  6. Other

choc¶

Initial shock point: -1. Not specified 0. None

  1. Before
  2. Right front
  3. Front left
  4. Rear
  5. Right back
  6. Left rear
  7. Right side
  8. Left side
  9. Multiple shocks (rolls)

manv¶

Main maneuver before the accident:

-1. Not specified 0. Unknown

  1. Without change of direction
  2. Same direction, same row
  3. Between 2 lines
  4. In reverse
  5. In the wrong way
  6. Crossing the central reservation
  7. In the bus lane, in the same direction
  8. In the bus lane, in the opposite direction
  9. By inserting
  10. By making a U-turn on the road Changing lane
  11. Left
  12. Right Deported
  13. Left
  14. Right Turning
  15. Left
  16. Right Exceeding
  17. Left
  18. Right Various
  19. Crossing the road
  20. Parking maneuver
  21. Avoidance maneuver
  22. Door opening
  23. Stopped (except parking)
  24. Parked (with occupants
  25. Traveling on sidewalk
  26. Other maneuvers

motor¶

Vehicle engine type:

-1. Not specified 0. Unknown

  1. Hydrocarbons
  2. Electric hybrid
  3. Electric
  4. Hydrogen
  5. Human
  6. Other

occutc¶

Number of occupants in public transport.

id_vehicule¶

Unique identifier of the vehicle used for each user occupying this vehicle (including pedestrians who are attached to the vehicles which collided with them). Numerical code.

Num_Veh¶

Identifier of the vehicle taken back for each of the users occupying this vehicle (including pedestrians who are attached to the vehicles which collided with them). Alphanumeric code.

place¶

Used to locate the space occupied in the vehicle by the user at the time of the accident Check on this link for the pattern : https://ibb.co/NsTxbXP

catu¶

User category:
  1. Driver
  2. Passenger
  3. Pedestrian

grav¶

Severity of user injury, injured users are classified into three categories of
victims plus unharmed:
  1. Unharmed
  2. Killed
  3. Injured hospitalized
  4. Slightly injured

sexe¶

Driver gender:
  1. Male
  2. Female

An_nais¶

Year of birth of the driver

trajet¶

Reason for travel at the time of the accident:

-1. Not specified 0. Not specified

  1. Home. work
  2. Home. school
  3. Shopping. shopping
  4. Professional use
  5. Walk. leisure
  6. Other

Security equipment until 2018 was divided into 2 variables: existence and use. From 2019, this concerns use with up to 3 possible devices for the same user (especially for motorcyclists whose helmets and gloves are compulsory).

secu1¶

The character intelligence indicates the presence and use of safety equipment:

-1. Not specified 0. No equipment

  1. Belt 12
  2. Helmet
  3. Children's device
  4. reflective vest
  5. Airbag (2WD / 3WD)
  6. Gloves (2WD / 3WD)
  7. Gloves + Airbag (2WD / 3WD)
  8. Not determinable
  9. Other

secu2¶

The character intelligence indicates the presence and use of safety equipment:

-1. Not specified 0. No equipment

  1. Belt
  2. Helmet
  3. Children's device
  4. reflective vest
  5. Airbag (2WD / 3WD)
  6. Gloves (2WD / 3WD)
  7. Gloves + Airbag (2WD / 3WD)
  8. Not determinable
  9. Other

secu3¶

The character intelligence indicates the presence and use of safety equipment:

-1. Not specified 0. No equipment

  1. Belt
  2. Helmet
  3. Children's device
  4. reflective vest
  5. Airbag (2WD / 3WD)
  6. Gloves (2WD / 3WD)
  7. Gloves + Airbag (2WD / 3WD)
  8. Not determinable
  9. Other

locp¶

Localisation du piéton :

-1. Non renseigné 0. Sans objet Sur chaussée :

  1. A + 50 m du passage piéton
  2. A. 50 m du passage piéton Sur passage piéton :
  3. Sans signalisation lumineuse
  4. Avec signalisation lumineuse Divers :
  5. Sur trottoir
  6. Sur accotement
  7. Sur refuge ou BAU
  8. Sur contre allée
  9. Inconnue

actp¶

Pedestrian action:

-1. Not specified Moving 0. Not specified or not applicable

  1. Direction of colliding vehicle
  2. Opposite direction of the vehicle Various
  3. Crossing
  4. Masked
  5. Playing. running
  6. With animal
  7. Other A. Get on / off the vehicle B. Unknown

etatp¶

This variable is used to specify whether the injured pedestrian was alone or not:

-1. Not specified

  1. Alone
  2. Accompanied
  3. In a group
In [4]:
display(HTML('<h1><center>Missing values of the different tables (%)</center></h1>'))

a = pd.DataFrame(np.transpose(np.array((characteristics.columns,round(characteristics.isna().sum()/characteristics.shape[0]*100,2)),dtype=object,)),columns=['features','missing_rate'])
b = pd.DataFrame(np.transpose(np.array((vehicles.columns,round(vehicles.isna().sum()/vehicles.shape[0]*100,2)),dtype=object,)),columns=['features','missing_rate'])
c = pd.DataFrame(np.transpose(np.array((places.columns,round(places.isna().sum()/places.shape[0]*100,2)),dtype=object,)),columns=['features','missing_rate'])
d = pd.DataFrame(np.transpose(np.array((drivers.columns,round(drivers.isna().sum()/drivers.shape[0]*100,2)),dtype=object,)),columns=['features','missing_rate'])

def highlight_greaterthan(x):
    if x.missing_rate > 80:
        return ['background-color: #FFCECE']*2
    if x.missing_rate > 40:
        return ['background-color: #FFE9CE']*2
    if x.missing_rate > 5:
        return ['background-color: #FFFECE']*2
    else:
        return ['background-color: #CEFFFC']*2
    
a = a.style.apply(highlight_greaterthan, axis=1).set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])
b = b.style.apply(highlight_greaterthan, axis=1).set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])
c = c.style.apply(highlight_greaterthan, axis=1).set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])
d = d.style.apply(highlight_greaterthan, axis=1).set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])

a_styler = a.set_table_attributes("style='display:inline'").set_caption('characteristics')
b_styler = b.set_table_attributes("style='display:inline'").set_caption('vehicles')
c_styler = c.set_table_attributes("style='display:inline'").set_caption('places')
d_styler = d.set_table_attributes("style='display:inline'").set_caption('drivers')

space = "\xa0" * 50
display_html(a_styler._repr_html_() + space + b_styler._repr_html_() + space + c_styler._repr_html_() + space + d_styler._repr_html_(), raw=True)

display(HTML('<h3><i>The values highlighted are the ones above a certain threshold of missing values</i></h3>'))
display(HTML('<h3><i>We will get rid of those for the rest of the notebook</i></h3>'))

Missing values of the different tables (%)

characteristics
features missing_rate
0 jour 0.220000
1 mois 0.220000
2 an 0.220000
3 hrmn 0.220000
4 lum 0.220000
5 dep 0.220000
6 com 0.220000
7 agg 0.220000
8 int 0.220000
9 atm 0.220000
10 col 0.220000
11 adr 0.940000
12 lat 0.220000
13 long 0.220000
                                                  
vehicles
features missing_rate
0 Num_Acc 0.000000
1 num_veh 0.000000
2 senc 0.000000
3 catv 0.000000
4 obs 0.000000
5 obsm 0.000000
6 choc 0.000000
7 manv 0.000000
8 motor 0.000000
9 occutc 99.110000
                                                  
places
features missing_rate
0 catr 0.040000
1 voie 5.030000
2 v1 18.310000
3 v2 92.920000
4 circ 5.390000
5 nbv 1.150000
6 vosp 1.180000
7 prof 0.070000
8 pr 12.450000
9 pr1 12.960000
10 plan 0.060000
11 lartpc 99.640000
12 larrout 99.370000
13 surf 0.080000
14 infra 0.120000
15 situ 0.240000
16 vma 1.540000
                                                  
drivers
features missing_rate
0 Num_Acc 0.000000
1 num_veh 0.000000
2 place 0.000000
3 catu 0.000000
4 grav 0.000000
5 sexe 0.000000
6 an_nais 0.000000
7 trajet 0.000000
8 secu1 0.000000
9 secu2 0.000000
10 secu3 0.000000
11 locp 0.000000
12 actp 0.000000
13 etatp 0.000000

The values highlighted are the ones above a certain threshold of missing values

We will get rid of those for the rest of the notebook

Visualizing datasets dtypes 1 by 1

In [5]:
characteristics_dtypes = pd.DataFrame(np.transpose(np.array((characteristics.columns,characteristics.dtypes),dtype=object,)),columns=['features','dtype'])
vehicles_dtypes = pd.DataFrame(np.transpose(np.array((vehicles.columns,vehicles.dtypes),dtype=object,)),columns=['features','dtype'])
places_dtypes = pd.DataFrame(np.transpose(np.array((places.columns,places.dtypes),dtype=object,)),columns=['features','dtype'])
drivers_dtypes = pd.DataFrame(np.transpose(np.array((drivers.columns,drivers.dtypes),dtype=object,)),columns=['features','dtype'])


characteristics_dtypes = characteristics_dtypes.style.set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])
vehicles_dtypes = vehicles_dtypes.style.set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])
places_dtypes = places_dtypes.style.set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])
drivers_dtypes = drivers_dtypes.style.set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', '#585858'),
        ('font-size', '30px')
    ]
}])


characteristics_dtypes_styler = characteristics_dtypes.set_table_attributes("style='display:inline'").set_caption('characteristics')
vehicles_dtypes_styler = vehicles_dtypes.set_table_attributes("style='display:inline'").set_caption('vehicles')
places_dtypes_styler = places_dtypes.set_table_attributes("style='display:inline'").set_caption('places')
drivers_dtypes_styler = drivers_dtypes.set_table_attributes("style='display:inline'").set_caption('drivers')
space = "\xa0" * 50
display_html(characteristics_dtypes_styler._repr_html_() + space + vehicles_dtypes_styler._repr_html_() + space + 
             places_dtypes_styler._repr_html_() + space + drivers_dtypes_styler._repr_html_(), raw=True)
characteristics
features dtype
0 jour float64
1 mois float64
2 an float64
3 hrmn float64
4 lum float64
5 dep object
6 com object
7 agg float64
8 int float64
9 atm float64
10 col float64
11 adr object
12 lat float64
13 long float64
                                                  
vehicles
features dtype
0 Num_Acc int64
1 num_veh object
2 senc int64
3 catv int64
4 obs int64
5 obsm int64
6 choc int64
7 manv int64
8 motor int64
9 occutc float64
                                                  
places
features dtype
0 catr float64
1 voie object
2 v1 float64
3 v2 object
4 circ float64
5 nbv float64
6 vosp float64
7 prof float64
8 pr float64
9 pr1 float64
10 plan float64
11 lartpc float64
12 larrout float64
13 surf float64
14 infra float64
15 situ float64
16 vma float64
                                                  
drivers
features dtype
0 Num_Acc int64
1 num_veh object
2 place int64
3 catu int64
4 grav int64
5 sexe int64
6 an_nais int64
7 trajet int64
8 secu1 int64
9 secu2 int64
10 secu3 int64
11 locp int64
12 actp object
13 etatp int64

characteristics

In [6]:
for col in characteristics.select_dtypes("object"):
    print('\n')
    print('Number of values in "',col,'"', {characteristics[col].nunique()})
    print(characteristics[col].unique())
    print('\n')
    print('------------------------------------------------')

Number of values in " dep " {107}
['93' '92' '94' '87' '69' '38' '34' '13' '988' '976' '974' '972' '2B' '91'
 '86' '83' '80' '78' '77' '76' '72' '71' '67' '66' '64' '60' '51' '50'
 '49' '45' '37' '35' '33' '31' '30' '29' '22' '19' '18' '17' '74' '81' '2'
 '59' '95' '63' '62' '973' '2A' '84' '9' '73' '43' '10' '36' '16' '7' '21'
 '40' '24' '4' '85' '27' '28' '52' '68' '42' '82' '11' '987' '44' '61'
 '14' '56' '58' '54' '47' '41' nan '3' '75' '1' '57' '32' '39' '15' '23'
 '6' '5' '26' '48' '986' '971' '89' '25' '12' '88' '65' '53' '70' '46'
 '90' '8' '79' '977' '55' '978' '975']


------------------------------------------------


Number of values in " com " {11421}
['93053' '93066' '92036' ... '67473' '85099' '76462']


------------------------------------------------


Number of values in " adr " {31934}
['AUTOROUTE A3' 'AUTOROUTE A1' 'AUTOROUTE A86' ... 'Route de Castelnau'
 'Route de Nieul-le-Dolent' "Boulevard l'Alouette"]


------------------------------------------------
In [7]:
sns.set(font_scale = 1.5)
plt.figure(figsize=(10, 30))
plt.title('Number of accidents in 2019 per Department')
sns.countplot(y=characteristics['dep'])
plt.xlabel("Number of accidents")
plt.ylabel("Department")
plt.show()
No description has been provided for this image
In [8]:
sns.set(font_scale = 1.5)
fig, ax = plt.subplots(3,4, figsize=(30, 15))
i=0
for col in characteristics.select_dtypes(include=['float64','int64']):
    sns.distplot(characteristics[col],label=col,ax=ax[i//4][i%4])
    i=i+1
fig.show()
No description has been provided for this image

Comments

  • We can notice on the first graph (accidents by department) that the number 75 skyrockets compared to the others
  • We can notice on the second graph that the feature "an" that corresponds to "year" has no variance (in fact the whole dataset is based only on the year 2019 so thats pretty much obvious). We will get rid of this feature in the future.
  • The two features "lat" and "long" corresponding to the Latitude and the Longitude of the accident are not scaled. There is a factor of 1e7 (to be changed)

vehicles

In [9]:
for col in vehicles.select_dtypes("object"):
    print('\n')
    print('Number of values in "',col,'"', {vehicles[col].nunique()})
    print(vehicles[col].unique())
    print('\n')
    print('------------------------------------------------')

Number of values in " num_veh " {48}
['B01' 'A01' 'C01' 'Z01' 'D01' 'E01' 'AB01' 'Y01' 'I01' 'T01' 'O01' 'G01'
 'F01' 'PB01' 'FB01' 'M01' 'LB01' 'H01' 'J01' 'K01' 'L01' 'CB01' 'X01'
 'N01' 'W01' 'U01' 'V01' 'MB01' 'RA01' 'TC01' 'R01' 'Q01' 'GB01' 'MA01'
 'VB01' 'RC01' 'BA01' 'TB01' '[01' '\\01' 'VF01' 'ZZ01' 'P01' 'DA01'
 'AA01' 'BB01' 'ZB01' 'BC01']


------------------------------------------------
In [10]:
sns.set(font_scale = 1.5)
plt.figure(figsize=(10, 15))
plt.title('Number of accidents in 2019 per number of occupants (by category)')
sns.countplot(y=vehicles['num_veh'])
plt.xlabel("Number of accidents")
plt.ylabel("Category")
plt.show()
No description has been provided for this image
In [11]:
sns.set(font_scale = 1.5)
fig, ax = plt.subplots(3,3, figsize=(30, 10))
i=0
for col in vehicles.select_dtypes(include=['float64','int64']):
    sns.distplot(vehicles[col],label=col,ax=ax[i//3][i%3])
    i=i+1
fig.show()
No description has been provided for this image

Comments

  • We can notice that the feature "obs" has a low variance

places

In [12]:
for col in places.select_dtypes("object"):
    print('\n')
    print('Number of values in "',col,'"', {places[col].nunique()})
    print(places[col].unique())
    print('\n')
    print('------------------------------------------------')

Number of values in " voie " {14327}
['3' '1' '86' ... 'PRESIDENT ROOSEVELT (RUE DU) N°2 A 26' 'VC du Rocher'
 'Gabriel Péri (AV)']


------------------------------------------------


Number of values in " v2 " {35}
[nan 'INT' 'B' 'D' ' -' 'A' 'N' 'R' 'E' 'F' 'C' 'EXT' 'Z' 'I' 'X' 'W' 'Y'
 'b' 'G' 'V' 'U' 'L' 'M' 'H' 'T' 'S' 'P' 'O' '15' 'K' 'II' '1A' 'IV' ' D'
 'A1' 'CD']


------------------------------------------------
In [13]:
sns.set(font_scale = 1.5)
fig, ax = plt.subplots(3,5, figsize=(30, 15))
i=0
for col in places.select_dtypes(include=['float64','int64']):
    sns.distplot(places[col],label=col,ax=ax[i//5][i%5])
    i=i+1
fig.show()
No description has been provided for this image

Comments

  • We can notice that the feature "V1","vosp","pr" have a low variance

drivers

In [14]:
for col in drivers.select_dtypes("object"):
    print('\n')
    print('Number of values in "',col,'"', {drivers[col].nunique()})
    print(drivers[col].unique())
    print('\n')
    print('------------------------------------------------')

Number of values in " num_veh " {29}
['B01' 'A01' 'C01' 'D01' 'E01' 'Z01' 'Y01' 'I01' 'T01' 'O01' 'G01' 'F01'
 'M01' 'LB01' 'H01' 'J01' 'K01' 'L01' 'N01' 'W01' 'X01' 'U01' 'V01' 'Q01'
 'MA01' 'CB01' '\\01' 'VF01' 'P01']


------------------------------------------------


Number of values in " actp " {13}
['-1' '0' '3' '2' '1' 'B' '4' '9' '5' 'A' '8' '6' '7']


------------------------------------------------
In [15]:
sns.set(font_scale = 1.5)
plt.figure(figsize=(10, 5))
plt.title('Number of accidents in 2019 sorted by pedestrian accions')
sns.countplot(y=drivers['actp'])
plt.xlabel("Number of accidents")
plt.ylabel("Category")
plt.show()
No description has been provided for this image
In [16]:
sns.set(font_scale = 1.5)
fig, ax = plt.subplots(3,4, figsize=(30, 15))
i=0
for col in drivers.select_dtypes(include=['float64','int64']):
    sns.distplot(drivers[col],label=col,ax=ax[i//4][i%4])
    i=i+1
fig.show()
No description has been provided for this image

Comments

  • We can notice that the feature "secu3" has a low variance

Summary¶

characteristics¶

  • "dep" 75 skyrockets compared to the others
  • "an" that corresponds to "year" has no variance (in fact the whole dataset is based only on the year 2019 so thats pretty much obvious). We will get rid of this feature in the future.
  • The two features "lat" and "long" corresponding to the Latitude and the Longitude of the accident are not scaled. There is a factor of 1e7 (to be changed)

vehicles¶

  • "obs" has a low variance

places¶

  • "V1","vosp","pr" have a low variance

drivers¶

  • "secu3" has a low variance

Cleaning data¶

In [17]:
characteristics = pd.read_csv('../input/2019-database-of-road-traffic-injuries/caracteristiques-2019.csv')
characteristics.name = 'characteristics'
places = pd.read_csv('../input/2019-database-of-road-traffic-injuries/lieux-2019.csv')
places.name = 'places'
drivers = pd.read_csv('../input/2019-database-of-road-traffic-injuries/usagers-2019.csv')
drivers.name = 'drivers'
vehicles = pd.read_csv('../input/2019-database-of-road-traffic-injuries/vehicules-2019.csv')
vehicles.name = 'vehicles'

datasets = [characteristics,places,vehicles,drivers]

# Indexing the tables
characteristics = characteristics.set_index('Num_Acc')
places = places.set_index('Num_Acc')
vehicles = vehicles.set_index('id_vehicule')
drivers = drivers.set_index('id_vehicule')

# Dealing with features with too many NaNs 
vehicles = vehicles.drop('occutc',axis=1)
places = places.drop(['v2','lartpc','larrout'],axis=1)

# Dealing with features according to the EDA
characteristics = characteristics.drop(['an','adr','com'],axis=1)
characteristics['lat']=characteristics['lat']/10000000
characteristics['long']=characteristics['long']/10000000
characteristics = characteristics.drop('201900033874',axis=0)
#characteristics = characteristics[characteristics['dep']!='2B'] # comment / uncomment
#characteristics = characteristics[characteristics['dep']!='2A'] # comment / uncomment
#characteristics = characteristics[(characteristics['dep'].astype(float)<100)] # comment / uncomment
#places = places.loc[characteristics.index.values] # comment / uncomment
places = places.drop('201900033874',axis=0) # comment / uncomment
places = places.drop(['v1','vosp','pr','voie'],axis=1)
vehicles = vehicles.drop('obs',axis=1)
drivers = drivers.drop(['secu3'],axis=1)

pd.set_option('display.max_row',max(characteristics.shape[0],places.shape[0],drivers.shape[0],vehicles.shape[0]))
pd.set_option('display.max_column',max(characteristics.shape[1],places.shape[1],drivers.shape[1],vehicles.shape[1]))

for df in datasets:
    print ("The dataset",df.name,"has",df.shape[0],"rows and",df.shape[1],"columns")
The dataset characteristics has 58840 rows and 15 columns
The dataset places has 58840 rows and 18 columns
The dataset vehicles has 100710 rows and 11 columns
The dataset drivers has 132977 rows and 15 columns
In [18]:
from sklearn.preprocessing import LabelEncoder
def encoding(df):
    label = LabelEncoder()
    for c in df.select_dtypes("object"):
        df[c]=df[c].astype("|S")
        df[c]=label.fit_transform(df[c])
    return df

def imputation(df):
    df = df.fillna(df.median())
    df = df.dropna()
    return df

def preprocessing(df):
    df = encoding(df)
    df = imputation(df) 

    return df
In [19]:
characteristics = preprocessing(characteristics)
vehicles = preprocessing(vehicles)
places = preprocessing(places)
drivers = preprocessing(drivers)
In [20]:
display(HTML('<h1>characteristics</h1>'))
display(characteristics.head())
display(HTML('<h1>vehicles</h1>'))
display(vehicles.head())
display(HTML('<h1>drivers</h1>'))
display(drivers.head())
display(HTML('<h1>places</h1>'))
display(places.head())

characteristics

jour mois hrmn lum dep agg int atm col lat long
Num_Acc
201900000001 30.0 11.0 0.062500 4.0 93 1.0 1.0 1.0 2.0 48.896210 2.470120
201900000002 30.0 11.0 0.118056 3.0 93 1.0 1.0 1.0 6.0 48.930700 2.368800
201900000003 28.0 11.0 0.635417 1.0 92 1.0 1.0 1.0 4.0 48.935872 2.319174
201900000004 30.0 11.0 0.847222 5.0 94 1.0 1.0 1.0 4.0 48.817329 2.428150
201900000005 30.0 11.0 0.166667 3.0 94 1.0 1.0 1.0 2.0 48.776362 2.433254

vehicles

Num_Acc num_veh senc catv obsm choc manv motor
id_vehicule
138 306 524 201900000001 3 2 7 2 5 23 1
138 306 525 201900000001 0 2 17 0 3 11 1
138 306 523 201900000002 0 1 7 0 1 0 1
138 306 520 201900000003 0 1 7 2 1 2 1
138 306 521 201900000003 3 1 7 0 4 2 1

drivers

Num_Acc num_veh place catu grav sexe an_nais trajet secu1 secu2 locp actp etatp
id_vehicule
138 306 524 201900000001 1 2 2 4 2 2002 0 1 0 -1 0 -1
138 306 524 201900000001 1 1 1 4 2 1993 5 1 0 -1 0 -1
138 306 525 201900000001 0 1 1 1 1 1959 0 1 0 -1 0 -1
138 306 523 201900000002 0 1 1 4 2 1994 0 1 0 -1 0 -1
138 306 520 201900000003 0 1 1 1 1 1996 0 1 0 -1 1 -1

places

catr circ nbv prof pr1 plan surf infra situ vma
Num_Acc
201900000001 1.0 3.0 10.0 1.0 900.0 2.0 1.0 2.0 1.0 70.0
201900000002 1.0 1.0 2.0 4.0 845.0 2.0 1.0 0.0 1.0 70.0
201900000003 1.0 3.0 8.0 1.0 500.0 3.0 1.0 0.0 1.0 90.0
201900000004 1.0 3.0 5.0 1.0 299.0 1.0 1.0 0.0 1.0 90.0
201900000005 1.0 1.0 3.0 1.0 0.0 3.0 1.0 2.0 1.0 90.0

Display accidents on a map¶

(Unzoom for oversea french lands)¶

In [21]:
lat = characteristics['lat']
lon = characteristics['long']
dep = characteristics['dep']
catr = places['catr'].map({1 : 'Highway',
2 : 'National road',
3 : 'Departmental road',
4 : 'Communal roads',
5 : 'Outside the public network',
6 : 'Parking lot open to public traffic',
7 : 'Urban metropolis roads',
9 : 'other'})
lum = characteristics['lum'].map({1 : 'Full day',
2 : 'Twilight or dawn',
3 : 'Night without public lighting',
4 : 'Night with public lighting not on',
5 : 'Night with public lighting on'})
atm = characteristics['atm'].map({1 : 'Normal',
2 : 'Light rain',
3 : 'Heavy rain',
4 : 'Snow. hail',
5 : 'Fog. smoke',
6 : 'Strong wind. storm',
7 : 'Dazzling weather',
8 : 'Cloudy weather',
9 : 'Other'})
col = characteristics['col'].map({1 : 'Two vehicles. frontal',
2 : 'Two vehicles. from the rear',
3 : 'Two vehicles. from the side',
4 : 'Three vehicles and more. in a chain',
5 : 'Three or more vehicles. multiple collisions',
6 : 'Other collision 7. No collision'})
circ = places['circ'].map({1 : 'One way',
2 : 'Bidirectional',
3 : 'A separate carriageway',
4 : 'With variable assignment channels'})
prof = places['prof'].map({1 : 'Flat',
2 : 'Slope',
3 : 'hilltop',
4 : 'Bottom of coast'})
plan = places['plan'].map({1 : 'rectilinear part',
2 : 'In a curve to the left',
3 : 'In a curve to the right',
4 : 'In "S"'})
surf = places['surf'].map({1 : 'Normal',
2 : 'Wet',
3 : 'Puddles',
4 : 'Flooded',
5 : 'Snowy',
6 : 'Mud',
7 : 'Icy',
8 : 'Fat. oil',
9 : 'Other'})
vma = places['vma']
In [22]:
import plotly.express as px
fig = px.scatter_mapbox(characteristics, 
                        lat="lat", 
                        lon="long", 
                        hover_name=catr, 
                        hover_data={'Light':lum,
                                    'Atmosphere':atm,
                                    'Collision':col,
                                    'Regime':circ,
                                    'Profile':prof,
                                    'Layout':plan,
                                    'Surface':surf,
                                    'Speed':vma,
                                    'long':False,
                                    'lat':False}, 
                        zoom=4.9, 
                        height=800, 
                        width=800)
fig.data[0]['marker'].update(color='red') #green
fig.data[0]['marker'].update(size=3)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Concatenating datasets¶

In [23]:
df_acc = pd.concat([characteristics, places.reindex(characteristics.index)], axis=1)
df_veh = pd.concat([drivers, vehicles.reindex(drivers.index)], axis=1)
pd.set_option('display.max_row',40)
pd.set_option('display.max_column',40)
display(HTML('<h1>df_acc</h1>'))
display(df_acc.head())
display(HTML('<h1>df_veh</h1>'))
display(df_veh.head())

df_acc

jour mois hrmn lum dep agg int atm col lat long catr circ nbv prof pr1 plan surf infra situ vma
Num_Acc
201900000001 30.0 11.0 0.062500 4.0 93 1.0 1.0 1.0 2.0 48.896210 2.470120 1.0 3.0 10.0 1.0 900.0 2.0 1.0 2.0 1.0 70.0
201900000002 30.0 11.0 0.118056 3.0 93 1.0 1.0 1.0 6.0 48.930700 2.368800 1.0 1.0 2.0 4.0 845.0 2.0 1.0 0.0 1.0 70.0
201900000003 28.0 11.0 0.635417 1.0 92 1.0 1.0 1.0 4.0 48.935872 2.319174 1.0 3.0 8.0 1.0 500.0 3.0 1.0 0.0 1.0 90.0
201900000004 30.0 11.0 0.847222 5.0 94 1.0 1.0 1.0 4.0 48.817329 2.428150 1.0 3.0 5.0 1.0 299.0 1.0 1.0 0.0 1.0 90.0
201900000005 30.0 11.0 0.166667 3.0 94 1.0 1.0 1.0 2.0 48.776362 2.433254 1.0 1.0 3.0 1.0 0.0 3.0 1.0 2.0 1.0 90.0

df_veh

Num_Acc num_veh place catu grav sexe an_nais trajet secu1 secu2 locp actp etatp Num_Acc num_veh senc catv obsm choc manv motor
id_vehicule
138 306 524 201900000001 1 2 2 4 2 2002 0 1 0 -1 0 -1 201900000001 3 2 7 2 5 23 1
138 306 524 201900000001 1 1 1 4 2 1993 5 1 0 -1 0 -1 201900000001 3 2 7 2 5 23 1
138 306 525 201900000001 0 1 1 1 1 1959 0 1 0 -1 0 -1 201900000001 0 2 17 0 3 11 1
138 306 523 201900000002 0 1 1 4 2 1994 0 1 0 -1 0 -1 201900000002 0 1 7 0 1 0 1
138 306 520 201900000003 0 1 1 1 1 1996 0 1 0 -1 1 -1 201900000003 0 1 7 2 1 2 1
In [24]:
df_acc = df_acc.loc[:,~df_acc.columns.duplicated()] # Get rid of duplicates
df_veh = df_veh.loc[:,~df_veh.columns.duplicated()]
df_veh.reset_index(drop=True, inplace=True)
df_veh.index = df_veh['Num_Acc'].astype('str')
df_veh = df_veh.drop(['Num_Acc'],axis=1)
df = pd.concat([df_acc.reindex(df_veh.index),df_veh],axis=1)
df = preprocessing(df)
In [25]:
pd.set_option('display.max_row',40)
pd.set_option('display.max_column',40)
display(HTML('<h1>df_acc</h1>'))
display(df_acc.head())
display(HTML('<h1>df_veh</h1>'))
display(df_veh.head())
display(HTML('<h1>Fully concatenated dataset : df</h1>'))
display(df.head())
print('\n')
print("----------------------------------")
display(HTML('<h3>Complete dataset shape :</h3>'))
display(df.shape)
print("----------------------------------")

df_acc

jour mois hrmn lum dep agg int atm col lat long catr circ nbv prof pr1 plan surf infra situ vma
Num_Acc
201900000001 30.0 11.0 0.062500 4.0 93 1.0 1.0 1.0 2.0 48.896210 2.470120 1.0 3.0 10.0 1.0 900.0 2.0 1.0 2.0 1.0 70.0
201900000002 30.0 11.0 0.118056 3.0 93 1.0 1.0 1.0 6.0 48.930700 2.368800 1.0 1.0 2.0 4.0 845.0 2.0 1.0 0.0 1.0 70.0
201900000003 28.0 11.0 0.635417 1.0 92 1.0 1.0 1.0 4.0 48.935872 2.319174 1.0 3.0 8.0 1.0 500.0 3.0 1.0 0.0 1.0 90.0
201900000004 30.0 11.0 0.847222 5.0 94 1.0 1.0 1.0 4.0 48.817329 2.428150 1.0 3.0 5.0 1.0 299.0 1.0 1.0 0.0 1.0 90.0
201900000005 30.0 11.0 0.166667 3.0 94 1.0 1.0 1.0 2.0 48.776362 2.433254 1.0 1.0 3.0 1.0 0.0 3.0 1.0 2.0 1.0 90.0

df_veh

num_veh place catu grav sexe an_nais trajet secu1 secu2 locp actp etatp senc catv obsm choc manv motor
Num_Acc
201900000001 1 2 2 4 2 2002 0 1 0 -1 0 -1 2 7 2 5 23 1
201900000001 1 1 1 4 2 1993 5 1 0 -1 0 -1 2 7 2 5 23 1
201900000001 0 1 1 1 1 1959 0 1 0 -1 0 -1 2 17 0 3 11 1
201900000002 0 1 1 4 2 1994 0 1 0 -1 0 -1 1 7 0 1 0 1
201900000003 0 1 1 1 1 1996 0 1 0 -1 1 -1 1 7 2 1 2 1

Fully concatenated dataset : df

jour mois hrmn lum dep agg int atm col lat long catr circ nbv prof pr1 plan surf infra situ vma num_veh place catu grav sexe an_nais trajet secu1 secu2 locp actp etatp senc catv obsm choc manv motor
Num_Acc
201900000001 30.0 11.0 0.062500 4.0 93.0 1.0 1.0 1.0 2.0 48.896210 2.470120 1.0 3.0 10.0 1.0 900.0 2.0 1.0 2.0 1.0 70.0 1 2 2 4 2 2002 0 1 0 -1 0 -1 2 7 2 5 23 1
201900000001 30.0 11.0 0.062500 4.0 93.0 1.0 1.0 1.0 2.0 48.896210 2.470120 1.0 3.0 10.0 1.0 900.0 2.0 1.0 2.0 1.0 70.0 1 1 1 4 2 1993 5 1 0 -1 0 -1 2 7 2 5 23 1
201900000001 30.0 11.0 0.062500 4.0 93.0 1.0 1.0 1.0 2.0 48.896210 2.470120 1.0 3.0 10.0 1.0 900.0 2.0 1.0 2.0 1.0 70.0 0 1 1 1 1 1959 0 1 0 -1 0 -1 2 17 0 3 11 1
201900000002 30.0 11.0 0.118056 3.0 93.0 1.0 1.0 1.0 6.0 48.930700 2.368800 1.0 1.0 2.0 4.0 845.0 2.0 1.0 0.0 1.0 70.0 0 1 1 4 2 1994 0 1 0 -1 0 -1 1 7 0 1 0 1
201900000003 28.0 11.0 0.635417 1.0 92.0 1.0 1.0 1.0 4.0 48.935872 2.319174 1.0 3.0 8.0 1.0 500.0 3.0 1.0 0.0 1.0 90.0 0 1 1 1 1 1996 0 1 0 -1 1 -1 1 7 2 1 2 1

----------------------------------

Complete dataset shape :

(132977, 39)
----------------------------------

Target visualization¶

In [26]:
plt.figure(figsize=(15,6))
sns.countplot(df['grav'].map({1:'Unharmed',
                                  2:'Killed',
                                  3:'Injured hospitalized',
                                  4:'Slightly injured'
                                }))
plt.plot()
Out[26]:
[]
No description has been provided for this image

Resampling (if needed)¶

A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).

Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free lunch). The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting. In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.

Let's implement a basic example, which uses the DataFrame.sample method to get random samples each class.

In [27]:
# Class count
count_class_4, count_class_1, count_class_3, count_class_2 = df['grav'].value_counts()

# Divide by class
df_class_1 = df[df['grav'] == 1]
df_class_2 = df[df['grav'] == 2]
df_class_3 = df[df['grav'] == 3]
df_class_4 = df[df['grav'] == 4]

df_class_1_under = df_class_1.sample(count_class_2,random_state=42)
df_class_4_under = df_class_4.sample(count_class_2,random_state=42)
df_class_3_under = df_class_3.sample(count_class_2,random_state=42)
df_under = pd.concat([df_class_1_under, df_class_2, df_class_3_under, df_class_4_under], axis=0)

df_class_2_over = df_class_2.sample(count_class_1, replace=True, random_state=42)
df_class_3_over = df_class_3.sample(count_class_1, replace=True, random_state=42)
df_class_4_over = df_class_4.sample(count_class_1, replace=True, random_state=42)
df_over = pd.concat([df_class_1, df_class_2_over, df_class_3_over, df_class_4_over], axis=0)

fig,axes = plt.subplots(1,2,figsize=(20,6),sharey=True)
sns.countplot(ax=axes[0],x=df_under['grav'].map({1:'Unharmed',
                                  2:'Killed',
                                  3:'Injured hospitalized',
                                  4:'Slightly injured'
                                }))
axes[0].set_title('Random Downsampling')
sns.countplot(ax=axes[1],x=df_over['grav'].map({1:'Unharmed',
                                  2:'Killed',
                                  3:'Injured hospitalized',
                                  4:'Slightly injured'
                                }))
axes[1].set_title('Random Oversampling')
plt.plot()
Out[27]:
[]
No description has been provided for this image

Modelling¶

In [28]:
trainset, testset = train_test_split(df_over, test_size=0.15, random_state=42)
fig, ax = plt.subplots(1,2, figsize=(18, 5))
sns.countplot(x = 'grav' , data = trainset,ax=ax[0],palette="Accent").set_title('TrainSet')
sns.countplot(x = 'grav' , data = testset,ax=ax[1],palette="Accent").set_title('TestSet')
Out[28]:
Text(0.5, 1.0, 'TestSet')
No description has been provided for this image
In [29]:
X_train = trainset.drop('grav',axis=1)
y_train = trainset['grav']
X_test = testset.drop('grav',axis=1)
y_test = testset['grav']
In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
In [31]:
preprocessor = make_pipeline(StandardScaler())

PCAPipeline = make_pipeline(preprocessor, PCA(n_components=3,random_state=0))

RandomPipeline = make_pipeline(preprocessor,RandomForestClassifier(random_state=0))
AdaPipeline = make_pipeline(preprocessor,AdaBoostClassifier(random_state=0))
SVMPipeline = make_pipeline(preprocessor,SVC(random_state=0,probability=True))
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier())
LRPipeline = make_pipeline(preprocessor,LogisticRegression(solver='sag'))

PCA Analysis¶

In [32]:
PCA_df = pd.DataFrame(PCAPipeline.fit_transform(df.drop('grav',axis=1)))
PCA_df = pd.concat([PCA_df.reset_index(), df['grav'].map({1:'Unharmed',
                                  2:'Killed',
                                  3:'Injured hospitalized',
                                  4:'Slightly injured'
                                }).reset_index()], axis=1)
PCA_df = PCA_df.drop(['index','Num_Acc'],axis=1)
PCA_df.head()
Out[32]:
0 1 2 grav
0 -2.026957 3.645923 -2.153101 Slightly injured
1 -2.769672 3.458293 -1.775232 Slightly injured
2 -2.743502 3.446155 -1.023593 Unharmed
3 -1.418706 2.277790 2.955279 Slightly injured
4 -2.403853 3.466685 -1.137749 Unharmed
In [33]:
figure1 = px.scatter_3d(PCA_df,
        x=0, 
        y=1, 
        z=2, 
        color = 'grav',
                       width=600, height=800)
figure1.update_traces(marker=dict(size=5,
                                  line=dict(width=0.15,
                                        color='black')),
                      selector=dict(mode='markers'))

figure1.show()

Training models¶

Models overview¶

In [34]:
dict_of_models = {'KNN': KNNPipeline,
                  'RandomForest': RandomPipeline,
                  'AdaBoost': AdaPipeline,
                  #'SVM': SVMPipeline,
                  'LR': LRPipeline}
In [35]:
def evaluation(model):
    model.fit(X_train, y_train)
    # calculating the predictions
    y_pred = model.predict(X_test)
    print('Accuracy = ', accuracy_score(y_test, y_pred))
    print('-')
    print(confusion_matrix(y_test,y_pred))
    print('-')
    print(classification_report(y_test,y_pred))
    print('-')
In [36]:
for name, model in dict_of_models.items():
    print('---------------------------------')
    print(name)
    evaluation(model)
---------------------------------
KNN
Accuracy =  0.72331660781763
-
[[5366  309 1050 1510]
 [   0 7957    0    0]
 [ 857  528 5863  771]
 [2056  373 1479 4167]]
-
              precision    recall  f1-score   support

           1       0.65      0.65      0.65      8235
           2       0.87      1.00      0.93      7957
           3       0.70      0.73      0.71      8019
           4       0.65      0.52      0.57      8075

    accuracy                           0.72     32286
   macro avg       0.72      0.72      0.72     32286
weighted avg       0.71      0.72      0.72     32286

-
---------------------------------
RandomForest
Accuracy =  0.9063680852381837
-
[[7135   13  412  675]
 [   0 7957    0    0]
 [ 120   17 7676  206]
 [1032   20  528 6495]]
-
              precision    recall  f1-score   support

           1       0.86      0.87      0.86      8235
           2       0.99      1.00      1.00      7957
           3       0.89      0.96      0.92      8019
           4       0.88      0.80      0.84      8075

    accuracy                           0.91     32286
   macro avg       0.91      0.91      0.91     32286
weighted avg       0.91      0.91      0.91     32286

-
---------------------------------
AdaBoost
Accuracy =  0.5417828160812737
-
[[6161  561  487 1026]
 [ 597 4925 1697  738]
 [ 885 2706 2672 1756]
 [2369  798 1174 3734]]
-
              precision    recall  f1-score   support

           1       0.62      0.75      0.68      8235
           2       0.55      0.62      0.58      7957
           3       0.44      0.33      0.38      8019
           4       0.51      0.46      0.49      8075

    accuracy                           0.54     32286
   macro avg       0.53      0.54      0.53     32286
weighted avg       0.53      0.54      0.53     32286

-
---------------------------------
LR
Accuracy =  0.4776373660410085
-
[[5330 1032  646 1227]
 [ 941 4729 1478  809]
 [1372 2774 2055 1818]
 [2492 1094 1182 3307]]
-
              precision    recall  f1-score   support

           1       0.53      0.65      0.58      8235
           2       0.49      0.59      0.54      7957
           3       0.38      0.26      0.31      8019
           4       0.46      0.41      0.43      8075

    accuracy                           0.48     32286
   macro avg       0.47      0.48      0.46     32286
weighted avg       0.47      0.48      0.47     32286

-

Using RandomForest¶

In [37]:
from sklearn.model_selection import RandomizedSearchCV
RandomPipeline.get_params().keys()
Out[37]:
dict_keys(['memory', 'steps', 'verbose', 'pipeline', 'randomforestclassifier', 'pipeline__memory', 'pipeline__steps', 'pipeline__verbose', 'pipeline__standardscaler', 'pipeline__standardscaler__copy', 'pipeline__standardscaler__with_mean', 'pipeline__standardscaler__with_std', 'randomforestclassifier__bootstrap', 'randomforestclassifier__ccp_alpha', 'randomforestclassifier__class_weight', 'randomforestclassifier__criterion', 'randomforestclassifier__max_depth', 'randomforestclassifier__max_features', 'randomforestclassifier__max_leaf_nodes', 'randomforestclassifier__max_samples', 'randomforestclassifier__min_impurity_decrease', 'randomforestclassifier__min_impurity_split', 'randomforestclassifier__min_samples_leaf', 'randomforestclassifier__min_samples_split', 'randomforestclassifier__min_weight_fraction_leaf', 'randomforestclassifier__n_estimators', 'randomforestclassifier__n_jobs', 'randomforestclassifier__oob_score', 'randomforestclassifier__random_state', 'randomforestclassifier__verbose', 'randomforestclassifier__warm_start'])
In [38]:
hyper_params = {
    'randomforestclassifier__n_estimators':[10,100,150,250,400,600],
    'randomforestclassifier__criterion':['gini','entropy'],
    'randomforestclassifier__min_samples_split':[2,6,12],
    'randomforestclassifier__min_samples_leaf':[1,4,6,10],
    'randomforestclassifier__max_features':['auto','srqt','log2',int,float],
    'randomforestclassifier__verbose':[0,1,2],
    'randomforestclassifier__class_weight':['balanced','balanced_subsample'],
    'randomforestclassifier__n_jobs':[-1],
}
In [ ]:
RF_grid = RandomizedSearchCV(RandomPipeline,hyper_params,scoring='accuracy',n_iter=40)
RF_grid.fit(X_train,y_train)
In [40]:
print(RF_grid.best_params_)
{'randomforestclassifier__verbose': 2, 'randomforestclassifier__n_jobs': -1, 'randomforestclassifier__n_estimators': 250, 'randomforestclassifier__min_samples_split': 6, 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__criterion': 'entropy', 'randomforestclassifier__class_weight': 'balanced'}
In [ ]:
best_forest = (RF_grid.best_estimator_)
best_forest.fit(X_train,y_train)
# calculating the predictions
y_pred = best_forest.predict(X_test)

N, train_score, test_score = learning_curve(best_forest, X_train, y_train, 
                                           cv=4, scoring='accuracy', 
                                           train_sizes=np.linspace(0.1,1,10))
In [42]:
print('Accuracy = ', accuracy_score(y_test, y_pred))
print('-')
print(confusion_matrix(y_test,y_pred))
print('-')
print(classification_report(y_test,y_pred))
print('-')
    
plt.figure(figsize=(5,5))
plt.plot(N, train_score.mean(axis=1), label='train score')
plt.plot(N, test_score.mean(axis=1), label='validation score')
plt.legend()
plt.title('Accuracy')
plt.show()
Accuracy =  0.9034566065787029
-
[[7159   13  414  649]
 [   0 7957    0    0]
 [ 153   14 7636  216]
 [1071   18  569 6417]]
-
              precision    recall  f1-score   support

           1       0.85      0.87      0.86      8235
           2       0.99      1.00      1.00      7957
           3       0.89      0.95      0.92      8019
           4       0.88      0.79      0.84      8075

    accuracy                           0.90     32286
   macro avg       0.90      0.90      0.90     32286
weighted avg       0.90      0.90      0.90     32286

-
No description has been provided for this image
In [43]:
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

# Binarize the output
y_train = label_binarize(y_train, classes=[1, 2, 3, 4])
y_test = label_binarize(y_test, classes=[1, 2, 3, 4])
n_classes = y_train.shape[1]

# Learn to predict each class against the other
classifier = OneVsRestClassifier(best_forest)
#y_score = classifier.fit(X_train, y_train).decision_function(X_test)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# # Plot of a ROC curve for a specific class
# plt.figure()
# plt.plot(fpr[2], tpr[2], label='ROC curve (area = %0.2f)' % roc_auc[2])
# plt.plot([0, 1], [0, 1], 'k--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver operating characteristic for class 2')
# plt.legend(loc="lower right")
# plt.show()

# Plot ROC curve
plt.figure(figsize=(15,10))
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]))
for i in range(n_classes):
    plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'
                                   ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   13.0s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:   55.4s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.5min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:   45.8s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.2min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   13.7s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:   59.4s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.6min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   14.3s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  1.6min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 158 tasks      | elapsed:    1.3s
[Parallel(n_jobs=2)]: Done 250 out of 250 | elapsed:    2.0s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 158 tasks      | elapsed:    0.7s
[Parallel(n_jobs=2)]: Done 250 out of 250 | elapsed:    1.1s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 158 tasks      | elapsed:    0.9s
[Parallel(n_jobs=2)]: Done 250 out of 250 | elapsed:    1.5s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    0.2s
[Parallel(n_jobs=2)]: Done 158 tasks      | elapsed:    1.0s
[Parallel(n_jobs=2)]: Done 250 out of 250 | elapsed:    1.6s finished
No description has been provided for this image